Avro metrics support: create MetricsAwareDatumWriter and some refactors for Avro #1946

yyanyy · 2020-12-16T22:34:50Z

This change is a smaller PR broken down from #1935. There is no change in behavior from this PR. It covers the following:

add comparator for byte array
updated field metrics bound type to allow object to byte buffer translation to occur later
add metrics() method to ValueWriter for Avro, currently default to empty stream
create MetricsAwareDatumWriter that exposes writer metrics, and replace DatumWriter with it in various classes
add metrics config to avro writer and builder
create a AvroMetrics class that resembles current behavior for producing metrics for avro writer

rdblue · 2020-12-17T01:59:23Z

Thanks @yyanyy, I'll take a look at this one soon!

core/src/main/java/org/apache/iceberg/avro/AvroFileAppender.java

rdblue · 2020-12-18T03:49:33Z

api/src/main/java/org/apache/iceberg/FieldMetrics.java

-  private final ByteBuffer lowerBound;
-  private final ByteBuffer upperBound;
+  private final Object lowerBound;
+  private final Object upperBound;


Why change this to Object rather than ByteBuffer? Seems like conversion to ByteBuffer would be cleaner if this was done in each writer because the writer already has its type because it is going to call the right method on the encoder.

I think if we convert this to ByteBuffer now we may still need to check type when doing truncation (based on metrics mode), and I think string with non-unicode characters will not vend the same result if truncated by BinaryUtil.truncateBinary, so we will either convert the byte buffer back to char sequence and use UnicodeUtil.truncateString or create a new BinaryUtil.truncateString. Whereas if we do conversion later when evaluating metrics, I think the code needed for the conversion itself isn't that bad since we know the type of the field, and that's the reason for me to do this change.

But one thing that may worth noting is that for the current approach, in order ensure the Conversions.toByteBuffer could work, for certain writers I have to make sure the min/max from the value writers return the type that Conversions.toByteBuffer knows how to translate, if the data type in write is not of that type (that is, usage of this method). I think we still need to maintain a similar function for translation in each value writer if we return bytebuffer for bounds in field metrics, but it will directly translate input data type to byte buffer instead of doing two hops, and that might be easier to understand.

I guess I was thinking that truncation would happen when FieldMetrics is constructed, in the leaf writers. If that's not the case, then I think it makes sense to do the conversion later.

If the conversion happens later, then I think this class should be parameterized. I never like to have classes that track just Object. We should at least guarantee that both lower and upper bounds are the same type, for example.

I was mostly following the pattern of ORC and Parquet to evaluate metrics mode when collecting metrics (which has to be since the file formats collects stats themselves), but I think there's nothing prevent us from ingesting metrics mode during value writers creation, it will just make the visitor pattern a little bit more complicated. I'll give it a try, and thanks for bringing up this idea!

I guess for now I'll revert the change to FieldMetrics in this PR and include it in the next one that updates value writers if we need to change it. Hopefully that doesn't add too much to the next PR!

rdblue · 2020-12-18T03:50:00Z

api/src/main/java/org/apache/iceberg/types/Comparators.java

@@ -157,6 +157,10 @@ public int compare(List<T> o1, List<T> o2) {
    return UnsignedByteBufComparator.INSTANCE;
  }

+  public static Comparator<byte[]> unsignedByteArray() {


Other method names are plural. Could we use unsignedByteArrays()?

rdblue · 2020-12-18T03:52:21Z

core/src/main/java/org/apache/iceberg/avro/Avro.java

@@ -123,7 +127,7 @@ public WriteBuilder named(String newName) {
      return this;
    }

-    public WriteBuilder createWriterFunc(Function<Schema, DatumWriter<?>> writerFunction) {
+    public WriteBuilder createWriterFunc(Function<Schema, MetricsAwareDatumWriter<?>> writerFunction) {


This is going to break existing uses of createWriterFunc in projects that build on Iceberg. I think this should keep the old parameter and just check whether the implementation is MetricsAwareDatumWriter in the appender to return metrics.

Sounds good, I didn't think of the case where people have their own implementation of these interfaces so I totally missed this. Will update and keep in mind!

rdblue · 2020-12-28T21:43:58Z

@yyanyy, can you rebase this to fix the conflict?

jackye1995

Thanks for rebasing, it looks good to me

jackye1995 · 2021-01-11T23:03:18Z

api/src/main/java/org/apache/iceberg/types/Comparators.java

+        }
+      }
+
+      // if there are no differences, then the shorter seq is first


nit: "the shorter seq is first" is a bit confusing to me, maybe "is smaller" is a better word.

rdblue · 2021-02-03T01:53:46Z

core/src/main/java/org/apache/iceberg/avro/Avro.java

+
+    @Override
+    public Stream<FieldMetrics> metrics() {
+      return Stream.concat(PATH_WRITER.metrics(), POS_WRITER.metrics());


This should also include metrics from the rowWriter, right?

This can be fixed in a follow-up.

core/src/main/java/org/apache/iceberg/avro/AvroFileAppender.java

rdblue · 2021-02-03T01:57:48Z

core/src/main/java/org/apache/iceberg/avro/AvroFileAppender.java

      CodecFactory codec, Map<String, String> metadata) throws IOException {
    DataFileWriter<D> writer = new DataFileWriter<>(
-        (DatumWriter<D>) createWriterFunc.apply(schema));
+        (DatumWriter<D>) metricsAwareDatumWriter);


Minor: I don't think this needs to be a MetricsAwareDatumWriter, right? It isn't in the type signature, so we should name it just datumWriter.

rdblue · 2021-02-03T02:01:24Z

Thanks, @yyanyy! This looks good now so I merged it. That should unblock the next steps.

github-actions bot added API core flink spark labels Dec 16, 2020

yyanyy mentioned this pull request Dec 16, 2020

Avro metrics support #1935

Closed

rdblue reviewed Dec 18, 2020

View reviewed changes

core/src/main/java/org/apache/iceberg/avro/AvroFileAppender.java Show resolved Hide resolved

rdblue reviewed Dec 18, 2020

View reviewed changes

github-actions bot added the data label Dec 18, 2020

giovannifumarola approved these changes Dec 22, 2020

View reviewed changes

create MetricsAwareDatumWriter and some refactors for Avro

6b6a15a

yyanyy force-pushed the avro_metrics_1 branch from 03a1b42 to 6b6a15a Compare January 5, 2021 20:38

yyanyy requested a review from rdblue January 5, 2021 22:41

jackye1995 reviewed Jan 11, 2021

View reviewed changes

minor comment update

1ba8097

jackye1995 approved these changes Jan 14, 2021

View reviewed changes

rdblue reviewed Feb 3, 2021

View reviewed changes

core/src/main/java/org/apache/iceberg/avro/AvroFileAppender.java Show resolved Hide resolved

rdblue reviewed Feb 3, 2021

View reviewed changes

rdblue approved these changes Feb 3, 2021

View reviewed changes

rdblue merged commit 6cc5d99 into apache:master Feb 3, 2021

rdblue mentioned this pull request Feb 3, 2021

Avro metrics support: track metrics in Avro value writers #1963

Closed

coolderli pushed a commit to coolderli/iceberg that referenced this pull request Apr 26, 2021

Avro: Add MetricsAwareDatumWriter (apache#1946)

3a3609f

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Avro metrics support: create MetricsAwareDatumWriter and some refactors for Avro #1946

Avro metrics support: create MetricsAwareDatumWriter and some refactors for Avro #1946

yyanyy commented Dec 16, 2020

rdblue commented Dec 17, 2020

rdblue Dec 18, 2020

yyanyy Dec 18, 2020

rdblue Dec 18, 2020

yyanyy Dec 18, 2020

rdblue Dec 18, 2020

rdblue Dec 18, 2020

yyanyy Dec 18, 2020

rdblue commented Dec 28, 2020

jackye1995 left a comment

jackye1995 Jan 11, 2021

rdblue Feb 3, 2021

rdblue Feb 3, 2021

rdblue Feb 3, 2021

rdblue commented Feb 3, 2021

Avro metrics support: create MetricsAwareDatumWriter and some refactors for Avro #1946

Avro metrics support: create MetricsAwareDatumWriter and some refactors for Avro #1946

Conversation

yyanyy commented Dec 16, 2020

rdblue commented Dec 17, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rdblue commented Dec 28, 2020

jackye1995 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rdblue commented Feb 3, 2021